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Abstract 



There are two main algorithms which have been considered for fitting constrained 
marginal models to discrete data; these are studied in detail and their properties clar- 
ified. The two procedures are shown to be equivalent, in the sense that the updates 
they produce are identical, each method being advantageous in different circumstances. 
An extension is provided to one of the algorithms for modelling the effect of exoge- 
nous individual-level covariates, and an application of the method to likelihood-based 
. estimation under Li-penalties is also considered. 
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$ '. 1 Introduction 

The application of marginal constraints to multi-way contingency tables has been much in- 
vestigated in the last 20 years; see, for example, McCullagh and Nelder (1989), Liang et al. 
(1992), Glonek and McCullagh (1995), Agresti (2002), Bergsma et al. (2009). Bergsma 
OV and Rudas (2002) introduced marginal log-linear parameters (MLLPs), which generalize 

other discrete parameterizations including ordinary log-linear parameters and Glonek and 
McCullagh's multivariate logistic parameters. The flexibility of this family of parame- 
terizations enables their application to many popular classes of conditional independence 
models, and especially to graphical models (Forcina et al., 2010, Rudas et al., 2010, Evans 
and Richardson, 2011). Bergsma and Rudas (2002) show that, under certain conditions, 
models defined by linear constraints on MLLPs are curved exponential families. However, 
naive algorithms for maximum likelihood estimation with MLLPs face several challenges: 
in general, there are no closed form equations for computing raw probabilities from MLLPs, 
so direct evaluation of the log-likelihood can be time consuming. In addition, MLLPs are 
not necessarily variation independent and, as noted by Bartolucci et al. (2007), ordinary 
Newton-Raphson or Fisher scoring methods may become stuck by producing updated 
estimates which are incompatible. 

Lang (1996) and Bergsma (1997), amongst others, have tried to adapt a general al- 
gorithm introduced by Aitchison and Silvey (1958) for constrained maximum likelihood 
estimation to the context of marginal models. In this paper we provide an explicit for- 
mulation of Aitchison and Silvey's algorithm, and show that an alternative method due 
to Colombi and Forcina (2001) is equivalent; we term this second approach the regression 
algorithm. The regression algorithm may be preferable if the number of constraints is 
large, particularly in the presence of individual-level covariates, in which case Aitchison 
and Silvey's approach is often infeasible. A variation of these algorithms, which can be 
used to fit marginal log-linear models under Li-penalties, is also given. 
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In Section 2 we review marginal log-linear models and their basic properties. In Sec- 
tion 3 we formulate the two algorithms, show that they are equivalent and discuss their 
properties. In Section 4 we describe applications to individual-level covariates, and finally 
Section 5 considers similar methods for L\ -constrained estimation. 

2 Notations and preliminary results 

Let Xj, j = 1, . . . , d be categorical random variables taking values in {1, . . . , Cj}. The 
joint distribution of X±, . . . , is determined by the vector of joint probabilities 7r, of 
dimension t = Y\i c ji whose entries correspond to cell probabilities, and are assumed to be 
strictly positive; we take the entries of 7r to be in lexographic order. Further, let y denote 
the vector of cell frequencies with entries arranged in the same order as iv. We write the 
multinomial log-likelihood in terms of the canonical parameters as 

1(6) = y'GO -nlog[i;exp(G0)] 

(see, for example, Bartolucci et al., 2007, p. 699); here n is the sample size, It a vector of 
length t whose entries are all 1, and G a t x (t — 1) full rank design matrix which determines 
the log-linear parameterization. The mapping between the canonical parameters and the 
joint probabilities may be expressed as 

log(Tr) = GO - l t \og[l' t exp(G6>)] & = L log(Tr), 

where L is a (t — 1) x t matrix of row contrasts and LG = It—i- 

Expressions for the score vector, s, and the expected information matrix, F, with 
respect to 6 take the form 

s = G'{y-niv) and F = nG'ftG, 

here fl = diag(7r) — 7r7r'. 

2.1 Marginal log-linear parameters 

Marginal log-linear parameters (MLLPs) enable the simultaneous modelling of several 
marginal distributions (see, for example, Bergsma et al., 2009, Chapters 2 and 4) and 
the specification of suitable conditional independencies within marginal distributions of 
interest (see Evans and Richardson, 2011). In the following let rj denote an arbitrary 
vector of MLLPs; it is well known that this can be written as 

rj = Clog(MTr), 

where C is a suitable matrix of row contrasts, and M a matrix of 0's and l's producing 
the appropriate margins (see, for example, Bergsma et al., 2009, Section 2.3.4). 

Bergsma and Rudas (2002) have shown that if a vector of MLLPs r\ is complete and 
hierarchical, two properties defined below, models determined by linear restrictions on rj 
are curved exponential families, and thus smooth. Like ordinary log-linear parameters, 
MLLPs may be grouped into interaction terms involving a particular subset of variables; 
each interaction term must be defined within a margin of which it is a subset. 

Definition 1. A vector of MLLPs r] is called complete if every possible interaction is 
defined in precisely one margin. 

Definition 2. A vector of MLLPs t] is called hierarchical if there is a non- decreasing or- 
dering of the margins of interest Mi, . . . , M s such that, for each j = 1, . . . s, no interaction 
term which is a subset of Mj is defined within a later margin. 
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3 Two algorithms for fitting marginal log-linear models 



Here we describe the two main algorithms used for fitting models of the kind described 
above. 

3.1 An adaptation of Aitchison and Silvey's algorithm 

Aitchison and Silvey (1958) study maximum likelihood estimation under non-linear con- 
straints in a very general context, and show that, under certain conditions, the maximum 
likelihood estimates exist and are asymptotically normal; they also outline an algorithm 
for computing those estimates. Suppose we wish to maximize 1(0) subject to h(0) = 0, a 
set of r non-linear constraints, under the assumption that the second derivative of h(0) ex- 
ists and is bounded. Aitchison and Silvey propose to maximize the function 1(0) + h(0)'X, 
where A is a vector of Lagrange multipliers; this leads to the system of equations 

s(0) + H(0)\ = 

(1) 

h(0) = 0, 

where is the ML estimate and H the derivative of h! with respect to 0. Since these 
are non-linear equations, they suggest an iterative algorithm which proceeds as follows: 
suppose that at the current iteration we have 0q, a value reasonably close to 0. Replace 
s and h with first order approximations around 0q] in addition replace H(0) with H(0q) 
and the second derivative of the log-likelihood with — F, minus the expected information 
matrix. The resulting equations, after rearrangement, may be written in matrix form as 

o-o \ = (f -h y 1 ( s 

A J V -H'q J \h 

where sq, Fq, Hq denote the corresponding quantities evaluated at 0q. To compute a 
solution, Aitchison and Silvey (1958) exploit the structure of the partitioned matrix, while 
Bergsma (1997) solves explicitly for by substitution; in both cases, if we are uninterested 
in the Lagrange multipliers, we get the updating equation 

= O + F-'so - F^Ho^F^Hoy'^F^so + h ). (2) 

As noted by Bergsma (1997), the algorithm does not always converge unless some sort of 
step length adjustment is introduced. 

Linearly constrained marginal models are defined by K'rj = 0, where K is a matrix 
of full column rank r < t — 1. The multinomial likelihood is a regular exponential family, 
so these models may be fitted using the smooth constraint h(0) = K'r}(0) = 0, which 
implies that 

r _ dh_ df^dii _ , , 
30' Or,' 80' 



where 



^ = {dw) 1 = [C^g(M^M^g(K)G]-\ 



^ drj' 

Remark 1. In the equation above, the inverse derivative is used because cannot be writ- 
ten as a closed form function of r). In this expression we have replaced ft with diag(7r) by 
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exploiting the fact that rj is a homogeneous function of iv (see Bergsma et al., 2009, Sec- 
tion 2.3.4)- If the constrained model were not smooth then at singular points the Jacobian 
matrix R would not be invertible, implying that H is not of full rank and thus violating 
a crucial assumption in Aitchison and Silvey (1958). It has been shown (Bergsma and 
Rudas, 2002, Theorem 3) that completeness is a necessary condition for smoothness. 

3.2 A regression algorithm 

By noting that the Aitchison-Silvey algorithm is essentially based on a quadratic approxi- 
mation of 1(6) with a linear approximation of the constraints, Colombi and Forcina (2001) 
designed an algorithm which they believed to be equivalent to the original, though no 
formal argument was provided; this equivalence is proven in Proposition 1 below. Recall 
that, by elementary linear algebra, there exists a (t — 1) x (t — r — 1) design matrix X of 
full column rank such that K'X = 0, from which it follows that r\ = X(3 for a vector of 
t — r — 1 unknown parameters (3. Let s = R's and F = R'FR respectively denote the 
score and information relative to 77; then the regression algorithm consists of alternating 
the following steps: 

1. update the estimate of (3 by 

^-/3 = (X / F X)- 1 X / (F 7o + 5o), (3) 
where 7 = ri - X(3 ; 

2. update 6 by 

6-6 o = R o [X0-(3 o )- lo }. (4) 

Proposition 1. The updating equation in (2) is equivalent to the combined steps given in 
(3) and (4). 

A proof of this result is given in A. 

Remark 2. From the form of the updating equations (2), (3) and (4) it is clear that 
Proposition 1 remains true if identical step length adjustments are applied to the 6 updates. 
This does not hold, however, if adjustments are applied to the (3 updates of the regression 
algorithm. 

3.2.1 Derivation of the regression algorithm 

In a neighbourhood of 6q, approximate 1(6) by a quadratic function Q having the same 
information matrix and the same score vector as 1(6), 

l(6)^Q(d) = -^(6-t )'F (6-to), where t = 6 + F Q 1 s . 

Now compute a linear approximation of 6 with respect to f3 in a neighbourhood of 6q, 

6-6 ^R (Xf3-r lo ); (5) 

substituting into the expression for Q we obtain a quadratic function in (3. By adding and 
subtracting RqX(3$ and setting 8 = (3 — (3 , we have 

Q((3) = ~[R XS - # 7o " F^soj'FolRoXS - i? 7o " *V W 

A weighted least square solution of this local maximization problem gives (3) ; substitution 
into (5) gives (4). 



4 



Remark 3. The choice of X is somewhat arbitrary because the design matrix XA, where 
A is any non-singular matrix, implements the same set of constraints as X . In many cases 
an obvious choice for X is provided by the context; otherwise, if we are not interested in 
the interpretation of (3, any numerical complement of K will do. 

3.3 Comparison of the two algorithms 

The Aitchison-Silvey algorithm requires the inversion of the (t — 1) x (t — 1) matrix Fq 
and the r x r matrix (H' Fq 1 Hq); however if we choose, for example, G to be the identity 
matrix of size t with the first column removed, an explicit inverse exists 

F- 1 = [n(diag(7V) - tVtV')]- 1 = n' 1 [diag^)" 1 + l t _il{_i/(l - , 

where tt denotes the vector 7v with the first element removed. 

In the regression algorithm one needs to invert the (t — 1) x (t — 1) matrix Rq 1 and 
the (t - r - 1) x (t - r - 1) matrix (X'F Q X). Although the regression algorithm requires 
two inversions at each step, based upon our implementations it can be slightly faster than 
the Aitchison-Silvey approach when r is considerably larger than (t — l)/2, i.e. there are 
many more constraints than parameters to be estimated. The particular advantage of the 
regression algorithm comes in the context of individual- level covariates, as we demonstrate 
in Section 4. 

To give an idea of the time required, we used a random sample from a 4 5 table with 
2,000 observations; there were 182 empty cells out of 1,024. Two conditional independence 
models were fitted, one with 567 and one with 900 constraints. On an average PC, the 
Aitchison-Silvey algorithm took 26s to run the first case and 39s for the second, while the 
regression algorithm took 44s and 27s respectively. 

3.4 Properties of the algorithms 

Detailed conditions for the asymptotic existence of the maximum likelihood estimates of 
constrained models are given by Aitchison and Silvey (1958); see also Bergsma and Rudas 
(2002), Theorem 8. Much less is known about existence for finite sample sizes where 
estimates might fail to exist because of observed zeros. In this case, some elements of tt 
may converge to 0, leading the Jacobian matrix R to become ill-conditioned and making 
the algorithm unstable. 

Concerning the convergence properties of their algorithm, Aitchison and Silvey (1958, 
p. 827) noted only that it could be seen as a modified Newton algorithm and that similar 
modifications had been used successfully elsewhere. However, it is clear from the form of 
the updating equations that if the algorithms converge to some 9*, then the constraints 
h(9*) = are satisfied, and 9* is a stationary point of the constrained likelihood. 

To ensure that the stationary point reached by the algorithm is indeed a maximum, 
one could look at the eigenvalues of the observed information with respect to (3; if these 
have mixed signs, then we know that the algorithm has converged to a saddle point. An 
efficient formula for computing the observed information matrix is given in B. Since the 
log-likelihood of constrained marginal models is not, in general, concave, the algorithm 
may converge to a local maximum. It might therefore be advisable to apply the algorithm 
to a range of starting values, in order to check that the maximum is a global one. 

3.5 Extension to more general constraints 

Occasionally, one may wish to fit general constraints on marginal probabilities without the 
need to define a marginal log-linear parameterization; an interesting example is provided 
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by the relational models of Klimova et al. (2011). They consider constrained models of the 
form h(0) = Alog(Mir) = 0, where A is an arbitrary matrix of full row rank. Redefine 

Bh 

K' = — = Adiag(M7r)- 1 MOG 
do 

and note that, because A is not a matrix of row contrasts, h is not homogeneous in it, 
and thus the simplifications mentioned in Remark 1 do not apply. If the resulting model 
is smooth, implying that K is a matrix of full column rank r everywhere in the parameter 
space, it can be fitted with the ordinary Aitchison-Silvey algorithm. We now show how 
the same model can also be fitted by a slight extension of the regression algorithm. 

Let 6q be a starting value and Kq be a right inverse of K' at 6q; consider a first order 
expansion of the constraints 

h = h + K' o (0 - ) = K' (K h + e-e ) = o 

and let Xq be a matrix that spans the orthogonal complement of Kq. Then, with the 
same order of approximation, 

K h + e-6 = X f3; 

by solving the above equation for 6 — 6q and substituting into the quadratic approximation 
of the log-likelihood, we obtain an updating equation similar to (3): 

0-P o = (X'oFoXoy'X'olso + F (K h - X (3 )}. 
4 Modelling the effect of individual-level covariates 

When exogenous covariates are available, it may be of interest to allow the marginal 
log-linear parameters ij to depend upon individualdevel covariates as in a linear model: 
t} i = Xif3; here the matrix Xi specifies how certain marginal log-linear parameters depend 
on individual specific information, in addition to structural restrictions such as conditional 
independence. Let y^ i = 1, . . . , n, be a vector of length t with a 1 in the entry corre- 
sponding to the response pattern of the ith individual, and all other values 0; let y be 
the vector obtained by stacking the vectors yi one below the other. Alternatively, if the 
sample size is large and the covariates can take only a limited number of distinct values, 
y { may contain the frequency table of the response variables within the sub-sample of 
subjects with the ith configuration of the covariates; in this case n denotes the number 
of strata. This arrangement avoids the need to construct a joint contingency table of 
responses and covariates; in addition the covariate configurations with no observations are 
simply ignored. 

In either case, to implement the Aitchison-Silvey approach, stack the Xi matrices one 
below the other into the matrix X, and let K span the orthogonal complement of X; 
as before, we have to fit the set of constraints K't) = 0. However, while the size of (3 is 
fixed and, say, equal to q, K will be of size n(t — 1) x [n(t — 1) — q], and therefore with n 
moderately large, the problem is almost infeasible. 

In the regression formulation the complexity of the problem is at worst linear inn, as 
we now show. Let 6i denote the vector of canonical parameters for the ith individual and 
l(0i) = y'GOi — \og[l' t exp(G0j)] be the contribution to the log-likelihood. Note that Xi 
need not be of full column rank, a property which must instead hold for the matrix X; for 
this reason our assumptions are much weaker than those used by Lang (1996), and allow 
for more flexible models. 
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Both the quadratic and the linear approximations must be applied at the individual 
level; thus we set Oi — Oio = Riq(X,i(3 — T] i0 ), and the log-likelihood becomes 

n \ n 

E - "2 E^°( X ^ - 7io) - ^^'^[^(JCitf - 7l0 ) - ^To^o], 

i=l i=l 

where -y i0 = r] i0 - Xi(3 , s { = G'(j/j - 7v { ) and Fj = G'fiiG. 
Direct calculations lead to the updating expression 

h - /3 = (e [E x *(™^ + W , 

where Wi = R' i0 G'n i0 GRi . 

The complexity of this procedure is therefore at worst linear in the number of obser- 
vations, and potentially much less if the covariates are discrete. 

For an application of the method described above to social mobility tables see Dard- 
anoni et al. (2012). Social mobility tables are cross classifications of subjects according to 
their social class (columns) and that of their fathers (rows). The hypothesis of equality 
of opportunity would imply that the social class of sons is independent of that of their 
fathers. Mediating covariates may induce positive dependence between the social classes 
of fathers and sons, leading to the appearance of limited social mobility; to assess this, 
Dardanoni et al. (2012) fitted a model where the vector of marginal parameters for each 
son-father pair was allowed to depend on individual covariates, including the father's age, 
the results of cognitive and non-cognitive test scores taken by the son at school, and his 
academic qualifications. The analysis, based on the UK's National Child Development 
Survey, indicated that positive association was present even after controlling for a rich 
set of covariates, thus demonstrating inequality of opportunity for the sons based on their 
fathers' occupations. 



5 Li-penalized parameters 

Evans (2011) shows that, in the context of marginal log-linear parameters, consistent model 
selection can be performed using the so-called adaptive lasso. Since the adaptive lasso uses 
Li-penalties, we might therefore be interested in relaxing the equality constraints discussed 
above to a penalization framework, in which we maximize the penalized log-likelihood 

t-i 

m = i(e)-J2»j\vj(0)\, 

3=1 

for some vector of penalties v = (uj) > 0. 

The advantage of penalties of this form is that one can obtain parameter estimates 
which are exactly zero (Tibshirani, 1996), and therefore perform model selection without 
the need to fit many models separately. For now, assume that no equality constraints hold 
for r/, so we can take X to be the identity, and (3 = r\. This gives the quadratic form 

Q(rj) = ~[Ro(v ~ Vo) ~ F^soYFolRoiv - Vo ) - F^s } 
approximating 1(0) as before. Then (f> is approximated by 

4>(v) = -\[Ro(v ~ Vo) ~ F o 1 8 ]'F [Ro(ri - rj ) - F^sq] - E u j\Vj\, 

3 
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and we can attempt to maximize (j) by repeatedly solving the sub-problem of maximizing 
(p. Now, because the quadratic form Q(rj) is concave and differentiable, and the absolute 
value function | • | is concave, coordinate-wise ascent is guaranteed to find a local maximum 
of (j) (Tseng, 2001). Coordinate-wise ascent cycles through j = 1, 2, . . . , t — 1, at each step 
minimizing 

--[Ro{v ~ Vo) ~ F o 1 so]'F [Ro(v ~ Vo) ~ F^sq] ~ u j\Vj\ 

with respect to rjj, with 771, ... , rjj-i,rjj + i, . . . , r/ t _i held fixed. This is solved just by taking 

rjj = sign(77)(|?/| - Vj)+, 

where a+ = max{a, 0}, and f]j minimizes Q with respect to r/j (Friedman et al., 2010). This 
approach to the sub-problem may require a large number of iterations, but it is extremely 
fast in practice because each step is so simple. If the overall algorithm converges, then 
by a similar argument to that of Section 3.4, together with the fact that <p has the same 
supergradient as <p at rj = r] Q , we see that we must have reached a local maximum of <f>. 

Since penalty selection for the lasso and adaptive lasso is typically performed using 
computationally intensive procedures such as cross validation, its implementation makes 
fast algorithms such as the one outlined above essential. 

A Proof of Equivalence of the Algorithms 

Lemma 1. For matrices X and K, let the columns of X span the orthogonal complement 
of the space spanned by the columns of K. Then for any symmetric and positive definite 
matrix W 

W- 1 - W- l K{K'W- l K)- l K'W- 1 = XiX'WX^X'. (6) 

Proof Let U = W'^K and V = W^ 2 X and note that U'V = K' X = 0, then (6) 
follows from the identity U{U'U)- l U' + V{V'V)- l V' = I. □ 

Proof of Proposition 1. Recall s = R's and F = R'FR, and note that 

H'F X H = K'R T X F ' x (i? T X )'K = K'F X K\ 
using this in the updating equation (2) enables us to rewrite it as 



R l {6 - O ) = [F 1 - F 1 K(K'F 1 K)- 1 K'F 1 }s + 
- F^K(K'F^K)KF^F oVo . 



(7) 



Set W = Fq and note that (6) may be substituted into the first component of (7) and 
that its equivalent formulation 

F^K^K'Fq 1 K)- 1 K'F 1 = Fq 1 - X{X'FqX)- 1 X' 

may be substituted into the second component, giving 

R \6 - 6> ) = XiX'FoX^X'so -rj + X(X' 'FqX)- 1 X' 'F rj . 

This is easily seen to be the same as combining equations (3) and (4). □ 
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B Computation of the observed information matrix 

Lemma 2. Suppose that A is a p x q matrix, and that y, b, x and u are column vectors 
with respective lengths q, p, k and r. Then if A and b are constant, 

^dia g (Av)6 = diag(6)A^^. (8) 



Proof. 



^diag(Ay)& = A diag(Ay)6 |?i 

= (diag(Ay ul )6, • • • diag(Ay uh )b) — 

= (diag(6)Ay ul , • • • diag(b)Ay uh ) — 

. . dy du 
= d iag (6)A— — . 



□ 



The observed information matrix is minus the second derivative of the log-likelihood 
with respect to (3, that is 







dl{0) 
d(3' [ dp 



.±,X'R'G'(y-nn) 



^X'R'G'(y - mr) 



RX 



f)F}' 

nX'R'G'ftGRX - X'—^G'iy - niv)RX. 



Since s depends on 6 through both (y — niv) and R, the above derivative has two main 
components, where the one obtained by differentiating (y — nir) is minus the expected 
information. Using the well known expression for the derivative of an inverse matrix, it 
only remains to compute 

x W G ' {y ~ n7v)RX = x ' R '^k^ R>G>{y ~ n7v)RX = A ^e^ bRX 

where A = X'R' and b = R'G'(y — niv), giving 

= AC , / 9[diag( 7 r)MMiag(M7r)- 1 ] c/b ^ Y _ 
89' 



By two applications of (8), this is 

AG' [ diag(M' diag(M7r)- 1 C / b) 



- diag(7r)M / diag(C'b) diag(M7r)- 2 M] CtGRX. 
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